Analyizing Jobs

This notebook will hold all analytics related code on this dataset. There are a list of questions I would like to answer and provide meaningfull visualization of the answers. Some questions are:

  • How many jobs are posted last month, last 3 months for big tech hubs like London, Amsterdam, Auston or San Francisco?
  • Same for countries
  • What are the best paying jobs for a given city?
  • What are the best paying technologies for a given city?
  • What are the best paying technologies globally?

In [209]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
from unidecode import unidecode
import time 

from bokeh.charts import Bar, output_file, show
from bokeh.sampledata.autompg import autompg as df
from bokeh.io import output_notebook, show 

output_notebook()

jobs = pd.read_csv('../data/stackoverflow_jobs_enhanced.csv', thousands=',')
technologies = pd.read_csv('../data/technologies.csv')

# this is needed for excel export 
jobs.country = jobs.country.astype(str)
jobs.city = jobs.city.astype(str)

jobs['city']=jobs['city'].apply( lambda x:  unidecode(unicode(x, encoding = "utf-8")))  
jobs['country']=jobs['country'].apply( lambda x:  unidecode(unicode(x, encoding = "utf-8")))


Loading BokehJS ...

Top cities and countries posting jobs


In [210]:
top_cities = jobs.groupby(['city'])['jobid'].count().sort_values(ascending=False)
top_cities = top_cities.nlargest(15)

# create a data frame out of the series. 
# found this trick at stackoverflow: 
# http://stackoverflow.com/questions/10373660/converting-a-pandas-groupby-object-to-dataframe
df_cities = pd.DataFrame({'count' : top_cities}).reset_index()

In [211]:
p = Bar(df_cities, values='count', title='Jobs posted by citites', legend='top_right', label=CatAttr(columns=['city'], sort=False),)

# Need to add different color for different cities 
show(p)



In [212]:
top_countries = jobs.groupby(['country'])['jobid'].count().sort_values(ascending=False)
top_countries = top_countries.nlargest(15)

df_countries = top_countries.to_frame('count').reset_index()
top_countries


Out[212]:
country
US             9830
Germany        4093
UK             2334
Netherlands    1226
Canada          702
nan             591
Australia       410
Sweden          345
Switzerland     339
France          328
Ireland         249
Spain           239
Finland         202
Austria         198
India           157
Name: jobid, dtype: int64

In [213]:
bar_countries = Bar(df_countries, values='count', title='Jobs posted by countries', legend='top_right', label=CatAttr(columns=['country'], sort=False),)

# Need to add different color for different cities 
show(bar_countries)


Top technologies for a given city (London, Amsterdam and San Francisco)


In [214]:
# London top technologies 
london_tech = technologies[technologies.city == 'London'].groupby(['city', 'tech'])['jobid'].count().sort_values(ascending=False)
london_tech.nlargest(10)


Out[214]:
city    tech               
London  javascript             294
        java                   276
        python                 195
        amazon-web-services    146
        c#                     136
        angularjs              106
        linux                  100
        php                     96
        tdd                     86
        html                    86
Name: jobid, dtype: int64

In [205]:
# Amsterdam top technologies 
amsterdam_tech = technologies[technologies.city == 'Amsterdam'].groupby(['city', 'tech'])['jobid'].count().sort_values(ascending=False)
amsterdam_tech.nlargest(10)


Out[205]:
city       tech               
Amsterdam  java                   122
           javascript             106
           php                     63
           python                  59
           mysql                   48
           c++                     39
           css                     37
           c#                      37
           html                    35
           amazon-web-services     34
Name: jobid, dtype: int64

In [206]:
# Berlin 
berlin_tech = technologies[technologies.city == 'Berlin'].groupby(['city', 'tech'])['jobid'].count().sort_values(ascending=False)
berlin_tech.nlargest(10)


Out[206]:
city    tech      
Berlin  java          364
        javascript    333
        php           174
        python        153
        mysql         152
        angularjs     125
        sql           118
        css           115
        linux          98
        html           97
Name: jobid, dtype: int64

In [207]:
# Silicon Valley 
cal_tech = technologies[technologies.state == 'CA'].groupby('tech')['jobid'].count().sort_values(ascending=False)
cal_tech.nlargest(10)


Out[207]:
tech
javascript             542
java                   537
python                 531
c++                    282
linux                  237
amazon-web-services    213
sql                    198
angularjs              177
c#                     167
ruby-on-rails          165
Name: jobid, dtype: int64

In [166]:
# Ordered view of what technologies are the most sought after in wich city 
all_tech = technologies.groupby(['city', 'tech'])['jobid'].count().sort_values(ascending=False)
amsterdam_tech.nlargest(100)


Out[166]:
city                tech               
New York            javascript             385
Berlin              java                   359
                    javascript             329
New York            python                 294
London              javascript             290
                    java                   273
New York            java                   227
San Francisco       python                 215
                    javascript             209
London              python                 194
San Francisco       java                   176
Berlin              php                    173
Seattle             java                   165
München             java                   158
Berlin              mysql                  152
                    python                 152
London              amazon-web-services    145
München             javascript             137
Rotterdam           php                    135
London              c#                     135
Toronto             javascript             132
                    java                   131
New York            node.js                129
Rotterdam           scrum                  128
New York            linux                  125
Berlin              angularjs              125
Rotterdam           zend-framework         125
                    tdd                    122
                    unit-testing           122
Amsterdam           java                   120
                                          ... 
London              sql                     78
Philadelphia        mobile                  77
Bangkok             java                    77
San Francisco       scala                   76
                    c++                     76
No office location  javascript              76
Washington          javascript              75
New York            postgresql              74
London              mysql                   74
Berlin              spring                  74
Chicago             c#                      73
Denver              video                   73
Washington          java                    72
Stockholm           java                    72
San Francisco       node.js                 72
                    reactjs                 71
New York            sysadmin                71
London              c++                     71
New York            php                     70
                    android                 70
San Francisco       go                      69
Hamburg             php                     68
San Francisco       angularjs               68
New York            reactjs                 67
Seattle             python                  67
Dublin              java                    66
Philadelphia        java                    65
Stockholm           javascript              65
Berlin              scala                   64
Seattle             c++                     63
Name: jobid, dtype: int64

Dumping out data to csv

This is only need to find values for cleaning up and normalizing values


In [26]:
cities.to_frame('city').to_csv('../data/cities.csv', encoding = 'utf-8')
countries.to_frame('countries').to_csv('../data/countries.csv', encoding = 'utf-8')

Writing it to Excel


In [168]:
ew = pd.ExcelWriter('../data/stackjobs.xlsx',options={'encoding':'utf-8'})
df_cities.to_excel(ew, 'City')
df_countries.to_excel(ew, 'Country')
ew.save()

In [ ]: